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ABSTRACT 

For text retrieval systems, the assumption that all data 
structures reside in main memory is increasingly common. 
In this context, we present a novel incremental inverted in- 
dexing algorithm for web-scale collections that directly con- 
structs compressed postings lists in memory. Designing effi- 
cient in-memory algorithms requires understanding modern 
processor architectures and memory hierarchies: in this pa- 
per, we explore the issue of postings lists contiguity. Natu- 
rally, postings lists that occupy contiguous memory regions 
are preferred for retrieval, but maintaining contiguity in- 
creases complexity and slows indexing. On the other hand, 
allowing discontiguous index segments simplifies index con- 
struction but decreases retrieval performance. Understand- 
ing this tradeoff is our main contribution: We find that co- 
locating small groups of inverted list segments yields query 
evaluation performance that is statistically indistinguishable 
from fully-contiguous postings lists. In other words, it is not 
necessary to lay out in-memory data structures such that all 
postings for a term are contiguous; we can achieve ideal per- 
formance with a relatively small amount of effort. 

1. INTRODUCTION 

For text retrieval applications today, it is reasonable to 
assume that all index structures fit in main memory, so that 
query evaluation can avoid hitting disk altogether. In indus- 
try, this is a practical requirement given users' expectations 
of low latency responses and the operational demands of 
high throughput to serve many concurrent users [TO] . In the 
academic literature, there has been work on query evalua- 
tion using main-memory indexes [3l] , and the assumption of 
holding all index structures in memory is now common [23| 
[JJ, enabled by the falling prices of hardware. Servers ca- 
pable of holding web-scale collections in memory are within 
the reach of most academic researchers today. 

In this paper, we explore incremental (sometimes referred 
to as "online") inverted indexing algorithms in main mem- 
ory for modern web-scale collections. Our rationale is that 
if indexes are going to be served from memory, we should be 
able to build indexes in memory also, provided that the ad- 
ditional "working space" required by the indexer is modest. 
We describe a novel indexing algorithm for incrementally 
building compressed postings lists directly in memory. Of 
course, incremental indexing is not a new research topic, 
but most previous work assumes that the index will not fit 
in memory and must reside on disk. Our assumption puts 
us in a different, underexplored region of the design space. 

Frequently, indexing algorithms encode a tradeoff between 



indexing and retrieval performance. Our study specifically 
examines the issue of postings list contiguity, which mani- 
fests such a tradeoff. By contiguity we mean whether each 
postings list occupies a single block of memory or is split 
into multiple segment placed at different memory locations. 
Why does contiguity matter? From the retrieval perspec- 
tive, we would expect an impact on query evaluation speed: 
traversing postings lists that occupy a contiguous block of 
memory takes advantage of cache locality and processor pre- 
fetching, whereas discontiguous postings lists suffer from 
cache misses due to pointer chasing. However, from the 
indexing perspective, maintaining contiguous postings lists 
introduces substantial complexity, for example, requiring ei- 
ther a two-pass approach, eager over-allocation of memory, 
or repeatedly copying postings when they grow beyond a 
certain size. With each of these techniques we would ex- 
pect indexing performance to suffer. Thus, a faster, simpler 
indexing algorithm that does not attempt to maintain post- 
ings list contiguity may result in slower query evaluation. It 
is this tradeoff that we seek to understand in more detail. 

This paper has two main contribution: First, we present a 
novel in-memory incremental indexing algorithm with sev- 
eral desirable features: it is fast, scales to modern web-scale 
collections, and takes advantage of best practice index com- 
pression techniques. Second, in the context of this indexing 
algorithm, we explore the impact of postings lists contiguity 
on indexing and query evaluation performance (both con- 
junctive and disjunctive). We find that small discontiguous 
inverted list segments do indeed cause a drop in query eval- 
uation speed, but that co-locating small groups of index seg- 
ments yields performance that is statistically indistinguish- 
able from fully-contiguous postings lists (which are difficult 
to maintain in an online setting). In other words, we can 
achieve ideal performance with a relatively small amount of 
effort. This somewhat surprising result is explained in the 
context of modern processor architectures. To our knowl- 
edge, we are the first to explore this issue in the context of 
main-memory indexes. 

2. BACKGROUND AND RELATED WORK 
2.1 Processor Architectures 

The performance characteristics of rotational magnetic 
disks (slow seeks but good throughput) is well understood 
by the IR community, and previous disk-based algorithms 
are specifically adapted to these characteristics. Memory, 
however, exhibits a different set of performance characteris- 
tics that are less discussed in the community. Therefore, we 



begin with a high-level overview of modern processor archi- 
tectures as it pertains to the algorithms discussed here. 

Compared to the multi-core revolution in computing, a 
less-discussed, but just as important trend over the past two 
decades is the so-called "memory wall" [3| , where increases in 
processor speed have far outpaced improvements in memory 
latency. This means that RAM is becoming slower relative 
to the CPU. In the 1980s, memory latencies were on the or- 
der of a few clock cycles; today, it could be several hundred 
clock cycles. To hide this latency, computer architects have 
introduced hierarchical cache memories: a typical server to- 
day will have LI, L2, and L3 caches between the processor 
and main memory. The fraction of memory accesses that 
can be fulfilled by the cache is called the cache hit rate, and 
data not found in cache is said to cause a cache miss. Cache 
misses cascade down the hierarchy — if a datum is not found 
in LI, the processor tries to look for it in L2, then in L3, and 
finally in main memory (paying an increasing latency cost 
each level down). The key point is that memory latencies are 
not uniform, and can actually vary by orders of magnitude 
(comparing LI cache hit vs. accessing main memory). 

Managing cache content is a complex challenge, but there 
are two main principles relevant to a software developer. 
First, caches are organized into cache lines (typically 64 
bytes), which is the smallest unit of transfer between cache 
levels. That is, when a program accesses a particular mem- 
ory location, the entire cache line is brought into (LI) cache. 
Subsequent references to nearby memory locations are very 
fast, and therefore, it is worthwhile to organize data struc- 
tures to take advantage of this fact. Second, if a program 
accesses memory in a predictable sequential pattern (called 
striding), the processor will pre- fetch memory blocks and 
move them into cache, before the program has explicitly re- 
quested the memory locations. This means that predictable 
memory access patterns (e.g., traversing contiguous postings 
lists) are critical to high performance. 

Another salient property of modern CPUs is pipelining, 
where instruction execution is split between many stages. 
At each clock cycle, all instructions "in flight" advance one 
stage in the pipeline; new instructions enter the pipeline and 
instructions that leave the pipeline are "retired". Pipeline 
stages allow faster clock rates since there is less to do per 
stage. Modern superscalar CPUs add the ability to dispatch 
multiple instructions per clock cycle (and out of order) pro- 
vided that they are independent. 

The implication of this is that pointer chasing, which oc- 
curs when we try to locate the next segment of a discontigu- 
ous postings list, is slow due to what is called a data hazard 
in VLSI design terminology, when one instruction requires 
the result of another. When dereferencing pointers, we must 
first compute the memory location to access. Subsequent 
instructions cannot proceed until we know what memory 
location is needed — the processor simply stalls waiting for 
the result. That is, no dependent instructions can enter the 
pipeline, and given memory latencies discussed above, this 
delay can be many cycles. Thus, accessing arbitrary memory 
locations in RAM is not very efficient — this parallels the re- 
lationship between disk seeks and scans, but of course, with 
a completely different underlying physical model. 

In the context of information retrieval, there is one ad- 
ditional complexity worth noting. Following best practice, 
we use PForDelta [40| [36] for compressing postings lists. 
Since it is a block-based technique (i.e., integers are coded 



in groups, typically 128), decompression yields memory ac- 
cess patterns that differ from techniques which code one in- 
teger at a time (e.g., variable-length integers, 7 codes, etc.). 
Our experiments show that this has the effect of masking 
memory latencies and cache misses. 

2.2 Incremental Indexing 

In this paper, we only examine standard inverted indexes — 
mappings from terms to postings lists, where each posting 
holds the document id, term frequency, and term positions. 
We set aside alternatives such as bit signatures 39 , recent 
work on self-indexes |26] , as well as the approach of Lempel 
et al. [l7], who eschew inverted indexes completely. 

As previously mentioned, most previous work 011 incre- 
mental indexing assumes that postings lists do not fit in 
memory and ultimately must be organized on disk. The 
design space of indexing strategies is nicely illustrated by 
Tomasic et al. [32], who examined the problem of index up- 
dates: how to append an in-memory list M to a list L on disk. 
We summarize only the important results here. Tomasic et 
al. explored different disk allocation policies: with the con- 
stant approach, a constant amount of space is reserved at the 
end of every list for new postings. In contrast, the propor- 
tional strategy reserves empty space at the end proportional 
to the number of postings being written to disk; thus, longer 
lists have more room to grow. Complementary to these al- 
location policies is the update strategy. If the in-memory 
list to be written fits into the reserved space, then the on- 
disk list is updated in place. Otherwise, the authors discuss 
different options: whole, which combines the in-memory and 
on-disk list and writes the result to a new location, thereby 
maintaining a contiguous list; and new, which writes the in- 
memory list to a new location, thus creating a linked list of 
segments. Not surprisingly, experiments show that the new 
strategy is quicker for index updates since there is no need 
to copy data, while the whole strategy is preferred for query 
evaluation since it reduces the number of disk seeks needed 
when traversing postings. 

Other researchers explored different choices that can be 
understood in terms of the general strategies described above. 
For example, Brown et al. [5] proposed allocating space in 
powers of two, up to a maximum (2 4 , 2 5 , 2 13 ). If there is 
enough space at the end of the current on-disk list to accom- 
modate the in-memory postings, the in-memory postings are 
appended in place. Otherwise, a larger chunk is allocated 
and the contents of the old block are moved to the new one, 
with the new postings appended to its end. In another work, 
Shieh and Chung [29] elaborate on over-allocation strategies 
that take into account different statistics (e.g., space usage 
and update request rate). One additional finding supported 
by multiple studies is the importance of separately handling 
"short" and "long" postings, for example, by storing short 
postings directly in the dictionary [9] or in "buckets" 
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After a burst of activity in the early to mid 1990s, there 
was a lull in work on incremental indexing until a series of 
papers by Lester et al. [18[ 1 19] . Their basic strategy was to 
first perform in-memory inversion within a bounded buffer, 
for example, using the technique of Heinz and Zobel [15] . In- 
evitably (under the assumptions of limited memory), post- 
ings must be flushed to disk. Lester et al. outlined three 
options for what to do once memory is exhausted: rebuild 
the index on disk from scratch (not very practical), modify 
postings in place on disk (practical only for small updates), 



or to selectively merge in-memory and on-disk segments and 
rewrite to another region on disk. In particular, they ex- 
plored a geometric partitioning and hierarchical merging 
strategy that limits the number of outstanding partitions, 
thereby controlling query costs. The same basic idea was de- 
scribed at around the same time by Biittcher et al. [7j, who 
called their approach "logarithmic merge". Both approaches 
were subsequently generalized by Guo et al. [13], who in- 
troduced a method to dynamically adjust the sequence of 
segment merges. Recently, researchers have applied some 
of these techniques to solid state disks (SSDs) [2l], which 
manifest performance characteristics that are different from 
both RAM and disk; however, a full discussion of SSDs is 
beyond the scope of this work. 

Using the basic buffer-and-flush approach, Margaritis and 
Anastasiadis [24] make a different design choice: when the 
in-memory buffer reaches capacity, instead of flushing the 
entire in-memory index, they choose to flush only a por- 
tion of the term space (a contiguous range of terms based 
on lexicographic sort order), performing a merge with the 
corresponding on-disk portions of the inverted lists. The 
advantage of this is that it does not lead to a proliferation 
of index segments, compared to Lester et al. [19] . 

The above review focuses on incremental indexing, but of 
course, there has been a lot of work on indexing in general; 
see 38 for a survey. One way to ensure contiguous postings 
lists is to adopt a two-pass approach |12| |35| |14] , which is 
impractical for incremental indexing. Moffat and Bell [25] 
proposed a single-pass, sort-based approach (later improved 
by Heinz and Zobel [15]): in their method, whenever memory 
is exhausted, the in-memory postings are flushed to disk 
as separate index segments. A final post-processing step 
merges these individual segments into a single index. Again, 
this approach is unsuitable for incremental indexing. 

In terms of work specifically focused on in-memory in- 
dexing, Luk and Lam 22 proposed an in-memory storage 
allocation scheme based on variable-size linked lists. How- 
ever, it is unclear if the approach is scalable: they only report 
experiments on old TREC collections that are over an order 
of magnitude smaller than the ones we explore here. Fur- 
thermore, their work used a relatively inefficient technique 
for postings compression (variable-length integers) and does 
not build full positional indexes, as we do. 

Recently, Busch et al. [6] detailed the architecture of Early- 
bird, the in-memory retrieval engine that powers Twitter's 
real-time search. The design takes advantage of the fact 
that tweets are very short and incorporates a number of op- 
timizations that do not work in the general case — it cannot 
handle arbitrary collections, as we do. Earlybird adopts a 
federated architecture, where each server holds a dozen sep- 
arate index segments, only one of which (the "active" seg- 
ment) ingests new tweets. In the active segment, postings 
are not compressed, which simplifies the indexing algorithm. 
In contrast, we build a single monolithic index and compress 
postings on the fly, representing a different point in the de- 
sign space of possible in-memory architectures. 

3. APPROACH 

3.1 Basic Incremental Indexing Algorithm 

Our indexer consists of three main components, depicted 
in Figure [I] the dictionary, buffer maps, and the segment 
pool. The basic indexing approach is to accumulate postings 



in the buffer maps in an uncompressed form until the buffer 
fills up, and then to "flush" the contents to the segment pool, 
where the final compressed postings lists reside. Note that in 
this approach the inverted lists are discontiguous; we return 
to address this issue in Section T3. 21 

The dictionary is implemented as a hash table with a bit- 
wise hash function 28 and the move-to- front technique [34] , 



mapping terms (strings) to integers term ids (see [37] for a 
study that compares this to other approaches). There is 
nothing noteworthy about our dictionary implementation, 
and we claim no novelty in this design. The dictionary ad- 
ditionally holds the document frequency (df) for each term, 
as well as a head and tail pointer into the segment pool 
(more details below). In our implementation, term ids are 
assigned sequentially as we encounter new terms. 

A buffer map is a one-to-one mapping from term ids to ar- 
rays of integers (the buffers) . Since term ids increase mono- 
tonically, a buffer map can be implemented as an array of 
pointers, where each index position corresponds to a term id, 
and the pointer points to the associated buffer. The array 
of pointers is dynamically expanded to accommodate more 
terms as needed. To construct a positional index, we build 
three buffer maps: the document id (docid) map, the term 
frequency (tf) map, and the term positions map. As the 
names suggest, the docid map accumulates the document 
ids of arriving documents, the tf map holds term frequen- 
cies, and the term positions map holds term positions. There 
is a one-to-one correspondence between entries in the docid 
map and entries in the tf map (for each term that occurs in 
a document, there is exactly one term frequency), but a one- 
to-many correspondence between entries in the docid map 
and entries in the term positions map (there are as many 
term positions in each document as the term frequency). 

In the indexing loop, the algorithm receives an input doc- 
ument, parses it to gather all term frequencies and term po- 
sitions (relative to the current document, starting from one) 
for all unique terms, and then iterates over these unique 
terms, inserting the relevant information into each buffer 
map. Whenever we encounter a new term, the algorithm 
initializes an empty buffer in each buffer map for the cor- 
responding term id. Initially, the buffer size is set to the 
block size b that will eventually be used to compressed the 
data (leaving aside an optimization we introduce below to 
control the vocabulary size). Following best practice today, 
we use PForDelta [40[ |36| , with the recommended block size 
of b — 128. The term positions map expands one block at a 
time when it fills up to accommodate more positions. When 
the docid buffer for a term fills up, we "flush" all buffers 
associated with the term, compressing the docids, term fre- 
quencies, and term positions into what we call an inverted 
list segment, described below: 

Each inverted list segment begins with a run of docids, 
gap-compressed using PForDelta; call this D. By design, the 
docids occupy exactly one PForDelta block. Next, we com- 
press the term frequencies using PFor; call this F. Note that 
term frequencies cannot be gap-compressed, so they are left 
unmodified. Finally, we process the term positions, which 
are also gap-encoded, relative to the first term position in 
each document. For example, if in di the term was found at 
positions [1,5,9] and in di the term was found at positions 
[3, 16], we would code [1, 4, 4, 3, 13]. The term positions can 
be unambiguously reconstructed from the term frequencies, 
which provide offsets into the array of term positions. Since 
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Figure 1: A snapshot of our indexing algorithm. In the middle we have buffer maps for storing docids, tjs, 
and term positions: the gray areas show elements inserted for document 508, the current one to be indexed. 
Once the buffer for a term fills up, an inverted list segmented is assembled and added to the end of the 
segment pool and linked to the previous segment via addressing pointers. The dictionary maps from terms 
to term ids and holds pointers to the head and tail of the inverted list segments in the segment pool. 



the term positions array is likely longer than b, the compres- 
sion block size, the term positions occupy multiple blocks. 
Call the blocks of term positions Pi . . . P m . 

Finally, all the data are packed together in a contiguous 
block of memory as follows: 

[|£>|, D, |F|, F, {\Pi\, Pi} < i<m ] 

where the | • | operator returns the length of its argument. 
Since all the data are tightly packed in an otherwise unde- 
limited array, we need to explicitly store the lengths of each 
block to properly decode the data during retrieval. 

Each inverted list segment is written at the end of the 
segment pool, which is where the compressed inverted index 
ultimately resides. Conceptually, the segment pool is an 
bounded array with a pointer that keeps track of the current 
"end", but in practice the pool is allocated in large blocks 
and dynamically expanded as necessary. In order to traverse 
a term's postings during query evaluation, we need to "link" 
together the discontiguous segments. The first time we write 
a segment for a term id, we add its address (byte offset in the 
segment pool) to the dictionary, which serves as the "head" 
pointer (the entry point to postings traversal). In addition, 
we prepend to each segment the address (byte offset position 
in the segment pool) of the next segment in the chain. This 
means that every time we insert a new segment for a term, 
we have to go back and correct the "next pointer" for the 
last segment. We leave the next pointer blank for a newly- 
inserted segment to mark the end of the postings list for 
a term; this location is stored in the "tail pointer" in the 
dictionary. Once the indexer has processed all documents, 
the remaining contents of the buffer maps are flushed to the 
segment pool in the same manner. By default, we build full 
positional indexes, but our implementation has an option to 
disable the term position buffers if desired. In this case, the 
inverted list segments will be smaller, but other aspects of 
the algorithm remain exactly the same. 

Conceptually speaking, the postings list for each term is 
a linked list of inverted list segments, where each of the seg- 
ments is laid out in discontiguous monotonically-increasing 



byte offset positions in the segment pool and linked together 
with addressing pointers. Segments corresponding to differ- 
ent terms are arbitrarily interleaved in the segment pool. 
What are the implications of this design? On the posi- 
tive side, all data in the segment pool are "tightly packed" 
for maximum efficiency in memory utilization: there are no 
empty regions and there is no need for special delimiters. 
During indexing we guarantee that there is no heap frag- 
mentation, which may be a possibility if we simply used 
malloc to allocate space for each inverted list segment. On 
the negative side, postings traversal becomes an exercise in 
pointer chasing across the heap, without any predictable ac- 
cess patterns that will aid in processor pre-fetching across 
segment boundaries. Thus, as a query evaluation algorithm 
consumes postings, it is likely to encounter a cache miss 
whenever it reaches the end of a segment, since it has to 
follow a pointer. On the other hand, it is not entirely clear 
if this cache miss is a major concern: since PForDelta is 
block-based, postings are decompressed in blocks even if the 
inverted lists are contiguously stored in memory. 

In addition to "flushing to memory" (i.e., the segment 
pool) as opposed to flushing to disk, the operation of our 
indexer is fundamentally different from previous designs. 
In previous approaches, the in-memory buffer is completely 
flushed when the capacity limit is reached, which means that 
inverted lists associated with all terms are written to disk. 
In contrast, we only flush data associated with the term id 
whose buffer has reach capacity. 

One final optimization detail: we control the size of the 
term space by discarding terms that occur fewer than ten 
times (an adjustable document frequency threshold). This 
is accomplished as follows: instead of creating a buffer of 
length b when we first encounter a new term, we first allocate 
a small buffer equal to the df threshold. We buffer postings 
for new terms until the threshold is reached, after which we 
know that the term will make it into the final dictionary, and 
so we reallocate a buffer of length b. This two-step process 
reduces memory usage substantially since there are many 
rare terms in web collections. 



3.2 Segment Contiguity 

It is clear that our baseline indexing algorithm generates 
discontiguous inverted list segments. In order to create con- 
tiguous inverted lists, we would need an algorithm to rear- 
range the segments once they are written to the segment 
pool. Following the "remerging" idea in disk-based incre- 
mental indexing, we might merge multiple discontiguous seg- 
ments belonging to the same term id and transfer them to 
another region in memory, repeating if necessary. Alter- 
natively, when writing an inverted segment to the segment 
pool, we might leave some empty space — but since no pre- 
allocation policy can be prescient, we will either leave too 
much empty space (wasting memory) or not leave enough 
(necessitating further copying). These basic designs have 
been explored in the context of on-disk incremental index- 
ing (see Section [2.2[ ), but we argue that the issues become 
more complex in memory because we do not have an in- 
termediate abstraction of the file — the indexing algorithm 
must explicitly manage memory addresses. This amounts 
to implementing malloc and free for inverted list segments, 
which is a non-trivial task. 

Before going down this path, however, we first examined 
the extent to which contiguous segments would improve re- 
trieval efficiency, from better reference locality, pre-fetch 
cues provided to the processor, etc. Let us assume we have 
an oracle that tells us exactly how long each inverted list is 
going to be, so that we can lay out the segments end-to-end, 
without any wasted memory. We simulate this oracle con- 
dition by building the inverted index as normal, and then 
performing in-memory post-processing to lay out all the in- 
verted list contiguously. Obviously, in a real incremental in- 
dexing scenario, this is not a workable option, but this simple 
experiment allows us to measure the ideal performance from 
the perspective of query evaluation. Thus, we can establish 
two retrieval efficiency bounds — the query evaluation time 
on arbitrarily discontiguous inverted lists (the baseline al- 
gorithm) and on contiguous inverted lists (the upper bound 
on query evaluation speed). 

Using these two efficiency bounds as guides, we developed 
a simple yet effective approach to achieving increasingly bet- 
ter approximations of contiguous postings lists. Instead of 
moving compressed segments around after they have been 
added to the segment pool, we change the memory alloca- 
tion policy for the buffer maps. In the limit, if we increased 
buffer map sizes so that they are large enough to hold the 
entire document collection in uncompressed form, it is easy 
to see how we could build contiguous inverted list segments. 
As it turns out, we do not need to go to such extremes. 

In our strategy, whenever the docid buffer for a term be- 
comes full (and thus compressed and flushed to the segment 
pool), we expand that term's docid and tf buffers by a fac- 
tor of two (still allowing the term positions buffer to grow as 
long as necessary). This means that after the first segment 
of a term is flushed, new docid and tf buffers of length 2b 
replace the old ones; after the second flush, the buffer size 
increases to 4b, and then 8b, and so on. When a buffer of size 
2 m b becomes full, the buffer is broken down to 2 m segments, 
each segment is compressed as described earlier, and all 2 m 
segments are written at the end of the segment pool con- 
tiguously. This strategy allows long postings to become in- 
creasingly contiguous, without wasting space to pre-allocate 
large buffers to hold terms that turn out to be rare. 

To prevent buffers from growing indefinitely and to con- 



trol the memory pressure, we set a cap on the length of 
docid and tf buffers. That is, if the cap is set to 2 m b, 
then when the buffer size for a term reaches that limit, it is 
no longer expanded. This means that the maximum num- 
ber of contiguous segments allowed in the segment pool is 
2 m . We experimentally show that for relatively small val- 
ues of m, around 6 or 7, we achieve query evaluation speeds 
that are statistically indistinguishable from having an index 
with fully-contiguous inverted lists (i.e., the oracle condi- 
tion). The tradeoff of this approach is that we require more 
transient working memory during the indexing process, and 
that impacts the size of the collection that we can index. 
However, we experimentally show that the additional mem- 
ory requirements for implementing this approach are reason- 
able. Note that for on-disk incremental indexing algorithms, 
the strategy of increasing the in-memory buffer size is gen- 
erally not considered since those algorithms operate under 
an assumption of limited memory. In our case, we are sim- 
ply changing the allocation between transient working mem- 
ory for performing document inversion and the final index 
structures. In Section|6j we consider alternative designs and 
discuss why we settled on this approach. 

4. EXPERIMENTAL SETUP 

We performed experiments on two standard collections: 
Gov2 and ClueWeb09. The Gov2 collection is a crawl of 
.gov sites from early 2004, containing about 25 million pages, 
totaling 81GB compressed. ClueWeb09 is a best-first web 
crawl from early 2009. Our experiments used only the first 
English segment, which has 50 million documents (247GB 
compressed). For evaluation purposes, we used two sets of 
queries: the TREC 2005 terabyte track "efficiency" queries, 
which consist of 50,000 queries total; and a set of 100,000 
queries sampled randomly from the AOL query log [27] . 

Our indexer, called Zambezi, is implemented in C; it is 
currently single-threaded. To support the reproducibility of 
experiments described in this paper, the system is released 
under an open source licenser] Since this paper is focused 
on indexing, we wished to separate document parsing from 
the actual indexing. Therefore, we assumed that input test 
collections are already parsed, stemmed, with stopwords re- 
moved before indexing. Our reports of indexing speed do 
not include document pre-processing time. 

Experiments were performed on a server running Red Hat 
Linux, with dual Intel Xeon "Westmere" quad-core proces- 
sors (E5620 2.4GHz) and 128GB RAM. This particular ar- 
chitecture has a 64KB LI cache per core, split between data 
and instructions; a 256KB L2 cache per core; and a 12MB 
L3 cache shared by all cores of a single processor. 

We examined three aspects of performance: memory us- 
age, indexing speed, and query evaluation latency. The first 
two are straightforward, but we elaborate on the third. For 
each indexer configuration, we measured query evaluation 
speed in terms of query latency for two retrieval strategies: 
conjunctive retrieval using the SvS algorithm, demonstrated 
by Culpepper and Moffat [8] to be the best approach to post- 
ings intersection, and disjunctive query processing using the 
Wand algorithm [4], which represents a strong baseline for 
top k retrieval (with BM25). Note that for both cases we 
first indexed the collection, and then performed query eval- 
uation at the end — the interleaving of indexing and retrieval 

1 http : //zambezi . cc/ 
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49.7 (±0.2) 
87.5 (±1.6) 


47.1 (±0.1) 

83.2 (±0.5) 


45.9 (±0.4) 
80.7 (±0.3) 


44.4 (±0.5) 

75.5 (±0.5) 


42.9 (±0.4) 
75.7 (±0.8) 


42.0 (±0.3) 
75.8 (±0.3) 


41.6 (±0.1) 
75.2 (±0.2) 


41.6 (±0.4) 
75.0 (±0.6) 


41.3 (±0.1) 
75.3 (±1.2) 



Table 1: Average query latency (in milliseconds) for postings intersection using SvS with different buffer 
length settings. Results are averaged across 5 trials, reported with 95% confidence intervals. 



operations is beyond the scope of this work, since it involves 
tackling a host of concurrency challenges. 

The SvS algorithm sorts postings lists in increasing or- 
der of length, ft begins by intersecting the two smallest 
lists. At each step, the algorithm intersects the current in- 
tersection set with the next postings list, until all lists are 
consumed. Each intersection is carried out using a one-sided 
binary search, or "galloping" search. Note that with SvS we 
compute the entire intersection set. 

The Wand algorithm uses a pivot-based pointer-movement 
strategy which enables the algorithm to skip over postings 
of documents that cannot possibly be in the top k results by 
reasoning about the maximum score contribution for each 
term. Recently, Ding and Suel [TT] introduced an additional 
optimization called Block-Max Wand (BMW) that increases 
query evaluation speed. The idea is that instead of using 
the global maximum score of each term to compute the piv- 
ots, the algorithm uses a piece- wise upper-bound approxima- 
tion of the scores. However, this algorithm is not directly 
applicable for incremental indexing: in order to compute a 
score upper-bound for each block, the indexer needs accu- 
rate global statistics (such as average document length in the 
case of BM25). Thus, there are only two options: either use 
statistics at the time the block is written, which might com- 
promise correctness, or continually update the upper bounds 
whenever the statistics change, which is slow. 

Since our focus is not on query evaluation, we believe that 
experiments with SvS and Wand are sufficient to illustrate 
the tradeoffs our indexing algorithm manifests. Note that 
we do not consider any learning to rank approach [20] be- 
cause it represents an orthogonal issue, fn a modern multi- 
stage web search architecture 2 , 33 , an initial retrieval stage 
(e.g., using SvS or Wand) generates documents that are 
then reranking by a machine-learned ranking model. 

Finally, we compared our Zambezi indexer against two 
open-source retrieval engines: ZettahQ (vO.9.3), which im- 
plements the geometric partitioning approach of Lester et 
al. [f9] and fndrj^] (v5.1). To ensure a fair comparison with 
the other systems, we disabled their document parsing phase 
and used the already parsed documents as input. As with 
our algorithms, reports of indexing speed do not include time 
spent on document pre-processing. 

5. RESULTS 

5.1 Query Latency 

Table [TJ summarizes query latency for conjunctive query 
processing (postings intersection with SvS). The average la- 
tency per query is reported in milliseconds across five trials 
along with 95% confidence intervals. Each column shows dif- 
ferent indexing conditions: 16 is the baseline algorithm pre- 
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150.0 (±0.5) 
455.7 (±5.1) 


141.1 (±0.6) 
434.3 (±5.8) 


141.1 (±0.2) 
432.6 (±4.9) 



Table 2: Average query latency (in milliseconds) to 
retrieve the top 1000 hits in terms of BM25 using 
WAND (5 trials, with 95% confidence intervals). 
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(linked list of inverted list segments). 
128}6 represents a different upper bound 
in the buffer map growing strategy described in Section |3.2| 
The final column marked "contiguous" denotes the oracle 
condition in which all postings are contiguous; this repre- 
sents the ideal performance. 

From these results, we see that, as expected, discontiguous 
postings lists (16) yield slower query evaluation: on Gov2, 
queries are approximately 10% slower, while for ClueWeb09, 
the performance dropoff ranges from 16% to 20%. For higher 
values of 6, we allow the buffer maps to increase in length: 
at 326, query evaluation performance is statistically indis- 
tinguishable from the performance upper bound (i.e., confi- 
dence intervals overlap). That is, we only need to arrange 
inverted list segments in relatively small groups of 32 to 
achieve ideal performance. Later, we quantify the memory 
requirements of allocating larger buffer maps. 

Figure [2] illustrates query latency by query length, for 
the AOL query set on Gov2 and ClueWeb09, using different 
conditions. Not surprisingly, the latency gap between con- 
tiguous and the 16 condition widens for longer queries. On 
the other hand, the difference between a contiguous index 
and the 326 condition is indistinguishable across all query 
lengths — the lines practically overlap in the figures. 

For disjunctive query processing, we used the Wand al- 
gorithm to retrieve the top 1000 documents using BM25. 
Table [2] summarizes these experiments on different collec- 
tions and queries. For space considerations, we only report 
results for select buffer length configurations. These results 
are consistent with the conjunctive processing case. A max- 
imum buffer size of 326 yields query latencies that are statis- 
tically indistinguishable from a contiguous index. Note that 
the performance difference between fully-contiguous post- 
ings lists and 16 discontiguous postings lists is less than 7%. 
fn other words, there is much less performance degradation 
than in the SvS case. There is a good explanation for this 
behavior in terms of memory access patterns, which we come 
back to in Section [6] 

As with the conjunctive query processing case, we ana- 
lyzed query latency by length. The results, however, were 
not particularly insightful: as expected, query latency in- 
creases with length, and the performance differences be- 
tween the three conditions were so small that the plots es- 
sentially overlapped. For this reason, we did not include the 
corresponding figures here. 
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Figure 2: Query latency using SvS for the AOL query set, by query length for different buffer length settings. 
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Figure 3: Indexing speed for Indri and Zettair with different memory limits, and Zambezi (our indexer) with 
different contiguity conditions on Gov2 and ClueWeb09. Error bars show 95% conf. intervals across 3 trials. 



5.2 Indexing Speed 

Figure [3] shows indexing times for our indexer, Zettair, 
and Indri. For Zettair and Indri, we varied the amount of 
memory provided to the system. Note that we were not able 
to provide Zettair with more than 4GB memory due to its 
implementation. In C, the maximum size of an individual 
array is 2 32 and circumventing this restriction would have 
required substantial refactoring of the code, which we did 
not undertake. For our indexer, we report results with the 
different postings list contiguity conditions. Error bars show 
95% confidence intervals across 3 trials. In all conditions we 
do not include document pre-processing time. 

Indexing time with Indri appears to be relatively insen- 
sitive to the amount of memory provided, but it is overall 
slower than both Zettair and our indexer. However, the per- 
formance differences are more pronounced for Gov2 than for 
ClueWeb09. With Zettair, the maximum size of the memory 
buffer does have a significant impact on indexing time. Iron- 
ically, giving Zettair more memory actually slows down in- 
dexing speed! We explain this counter-intuitive result as fol- 
lows: smaller in-memory segments are more cache-friendly; 
for example, our system has a 12MB L3 cache, so in the 
20MB condition, more than half of the segment will reside 
in cache. On the other hand, smaller segments require more 
merging. In general, it seems like the first factor is more 
important: for ClueWeb09, indexing is fastest with 20MB 
buffers. For Gov2, increased cache performance is not suffi- 
cient to compensate for additional time spent merging, but 



the optimal balance occurs with 128MB buffers, which is 
still relatively smallQ 

These results show that our in-memory indexing algo- 
rithm is not substantially faster (and for some conditions 
on Gov2, actually slower) than an on-disk algorithm. Why 
might this be so? First, on-disk indexing algorithms have 
been studied for decades, and so it is no surprise that state- 
of-the-art techniques are well-tuned to the characteristics of 
disks. Second, cache locality effects and memory latencies 
slow down in-memory algorithms as the memory footprint 
increases — this is confirmed by the Zettair results, where, 
in general, giving the indexer more memory reduces perfor- 
mance. How does this happen? A larger in-memory foot- 
print means that we are accumulating more documents in 
the buffer, and hence managing a larger vocabulary space. 
This causes more "cache churn", since whenever we encounter 
a rare term, its associated data (e.g., recently-inserted post- 
ings) are fetched into cache, displacing another term's. Since 
the rare term is unlikely to appear in another document 
soon, it is wasting valuable space in the cache. In contrast, 
the merging operations for the on-disk algorithms access 
data in a very predictable pattern, thus creating opportuni- 
ties for the pre-fetchers to mask memory latencies. To test 
this hypothesis, we ran Indri with a memory size of 120GB 
on Gov2, and the indexer took 38k seconds to complete, 
which is roughly double the times reported in Figure|3] This 
result appears to support our analysis. 

4 Lester (p.c.) concurs with our explanation. 
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Figure 4: Memory required to hold all buffer maps 
for different buffer length settings, normalized to the 
16 setting, on Gov2 and ClueWeb09. 

Finally, we note that end-to-end system comparisons con- 
flate several components of indexers that may have nothing 
to do with the algorithms being studied — for example, we 
use PForDelta compression, whereas Zettair does not. The 
research question in our study, the impact of postings lists 
contiguity, is primarily addressed with experiments that con- 
sider different contiguity configurations. However, Zettair 
and fndri results provide external points of reference to con- 
textualize our findings. 

5.3 Memory Usage 

All inverted indexing algorithms require transient working 
memory to hold intermediate data structures. For on-disk 
incremental indexing algorithms, previous work has assumed 
that this working memory is relatively small. In our case, 
there is no hard limit on the amount of space we can devote 
to working memory, but space allocated for holding inter- 
mediate data takes away from space that can be used to 
store the final compressed postings lists, which limits the 
size of the collection that we can index for a fixed server 
configuration. 

At minimum, our buffer maps must hold the most recent 
6 docids, term frequencies, and associated term positions 
(leaving aside the rare terms optimization in Section pjTTj ). In 
our case, we set b = 128 to match best practices PForDelta 
block size; any smaller value would compromise decompres- 
sion performance. In order to increase the contiguity of the 
inverted list segments, we increase the length of the buffers, 
as described in Section 13.21 This of course increases the 
space requirements of the buffer maps. 

Figure [4] shows the maximum size of the buffer maps for 
different contiguity configurations, broken down by space de- 
voted to docids, term frequencies, and term positions. The 
reported values were computed analytically from the neces- 
sary term statistics, making the assumption that all terms 
reach their maximum buffer size at the same time, which 
makes these upper bounds on memory usage. To facilitate 
comparison across the two collections, we normalized the 
values to the 16 condition; in absolute terms, the total buffer 
map sizes are 12.6GB for Gov2 and 22.1GB for ClueWeb09. 
It is no surprise that as the maximum buffer length increases, 
the total memory requirement grows as well. At 1286, where 
we allow the buffer to grow to 128 blocks of 128 32-bit inte- 
gers, the algorithm requires 71% more space for Gov2 and 
95% more space for ClueWeb09 (compared to the 16 con- 
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Figure 5: Percentage of terms for which a buffer of 
length m x 6 is required, for different values of m, 
and block size 6 = 128. 

dition). At 326, which from our previous results achieves 
query evaluation performance that is statistically indistin- 
guishable from contiguous postings lists, we require 44% and 
70% more memory for Gov2 and ClueWeb09, respectively. 

As reference, the total size of the segment pool (i.e., size of 
the final index) is 31GB for Gov2 and 62GB for ClueWeb09. 
This means, on the Gov2 collection, setting the maximum 
buffer length to 16, 326 and 1286 results in a buffer map that 
is approximately 41%, 59%, and 69% of the overall size of 
the segment pool, respectively. Similarly, for ClueWeb09, 
the buffer map sizes are approximately 32%, 54%, and 63% 
of the size of the segment pool, respectively. These statistics 
quantify the overhead of our in-memory indexing algorithms. 

Note that most of the working memory is taken up by term 
positions; in comparison, the requirements for buffering do- 
cids and term positions are relatively modest. In all cases the 
present implementation uses 32-bit integers, even for term 
positions. We could easily cut the memory requirements for 
those in half by switching to 16-bit integers, although this 
would require us to either discard or arbitrarily truncate 
long documents. Ultimately, we decided not to sacrifice the 
ability to index long documents. 

The total number of unique terms is 31M in Gov2 and 79M 
in ClueWeb09. Since these collections consist of web pages, 
most of the terms are unique and correspond to JavaScript 
fragments that our parser inadvertently included and other 
HTML idiosyncrasies; such issues are prevalent in web search 
and HTML cleanup is beyond the scope of this paper. Our 
indexer discards terms that occur fewer than 10 times, which 
results in a vocabulary size of 2.9M for Gov2 and 6.9M for 
ClueWeb09. Of these, Figure [5] shows the percentage of 
terms that require a maximum buffer length of m x 6, for 
different values of m in our contiguity settings. For example, 
the 16 bar represents terms whose document frequencies are 
> 10 but < 128. The 26 bar represents terms whose doc- 
ument frequencies are > 128 but less than 16 + 26 = 384, 
and so on. The 1286 bar represents terms whose document 
frequencies exceed the maximum buffer length of 128 blocks. 
From this we can see why significantly increasing the 6 value 
only yields a modest increase in memory requirements. 

Finally, the average size of each inverted list segment for 
terms with a buffer length of 16 is about 300 bytes; for terms 
that require a buffer of length of 26, the average length is 
around 600 bytes. For terms with a buffer of length > 26, 
this value is about 800 bytes. These statistics make sense 



since lb terms may have less than a document frequency of 
128, and in general, rarer terms have smaller term frequen- 
cies, and hence fewer term positions. 

6. DISCUSSION 

Let us summarize our findings so far: compared to "ideal" 
contiguous postings lists, a linked list of inverted list seg- 
ments yields slower query evaluation. However, if the algo- 
rithm creates groups of 32 inverted list segments by buffer- 
ing, we can achieve performance that is statistically indistin- 
guishable from ideal performance. Thus, postings list conti- 
guity is important but only up to a point. 

From the processor architecture perspective, there are two 
interacting phenomena that contribute to this result: First, 
the memory latencies associated with pointer chasing in the 
linked lists are masked by PForDelta decompression. With 
contiguous postings lists, predictable striding allows pre- 
fetching to hide memory latencies, but postings are traversed 
in "bursts" since after reading each segment the algorithm 
must decode the blocks. Thus, decompression can hide cache 
misses in the case of discontiguous postings: while the pro- 
cessor is decompressing one segment, it can dispatch mem- 
ory requests for the next (since the instructions are inde- 
pendent). Second, query evaluation is more complex than 
a simple linear scan of postings lists: SvS performs gallop- 
ing search for intersection and Wand uses pivoting to skip 
around in the postings lists. This behavior creates unpre- 
dictability in memory access patterns and reduces opportu- 
nities for the pre-fetchers to detect striding patterns. To 
illustrate this, consider the difference between ideal perfor- 
mance and the 16 baseline condition: the performance gap 
is much smaller for Wand than for SvS. This makes sense, 
since at each stage, SvS is intersecting the current postings 
list with the working set: this implies greater cache locality, 
so we obtain a bigger performance boost with contiguous 
postings list. On the other hand, Wand pivots from term 
to term and at each step may advance the current pointer 
by an unpredictable amount; thus, even if the postings lists 
are contiguous, the processor may encounter cache misses. 
Thus, it makes less of a difference if the postings lists are 
discontiguous to begin with. 

The tradeoff in our approach is that our algorithm needs 
to devote working memory to buffering the relevant data, 
which takes away from space that can be devoted to the 
final compressed index — this limits the size of the collec- 
tion that we can handle. In practice, however, we do not 
believe this is an issue. In our experiments, 128GB of mem- 
ory is more than enough to handle 50 million documents 
(ClueWeb09). Most production systems adopt a partitioned 
architecture, where the entire document collection is split 
into smaller parts and assigned to individual servers [16] . 
The size of each partition is governed by many factors, one 
of which is query evaluation speed. In order to maintain 
constant query evaluation speed, the growth of the parti- 
tion size is limited by processor speed and memory latencies. 
However, the maximum possible partition size is dictated by 
the amount of memory available. Based on current trends, 
memory capacities are growing much faster than the speed of 
individual processor cores and improvements in memory la- 
tencies. Thus, the overhead required by our incremental in- 
dexing algorithm will become less and less of a concern over 
time. Even still, there are relatively simple optimizations 
that we can implement to significantly reduce the working 



memory requirements. Currently, all values in our buffer 
maps are 32-bit integers, but that is overkill for most cases. 
Buffered values can be stored in compressed form using stan- 
dard techniques such as variable-sized integers or Rice codes. 
This will especially reduce the space requirements for stor- 
ing term positions, whose values are generally small and can 
be gap-compressed on the fly. 

As an alternative to increasing the size of the buffer maps 
to increase postings list contiguity, we could pre-allocate 
memory in the same manner as on-disk incremental algo- 
rithms (i.e., when flushing an inverted list segment to the 
segment pool, leave extra space). We had considered this 
approach, but rejected it for a number of reasons. First, 
reserved space in our setting would need to be in multiples 
of the average inverted list segment due to the block nature 
of PForDelta compression, so neither the constant nor pro- 
portional strategy of Tomasic et al. [32] will work. However, 
since inverted list segments do not have fixed sizes, there is 
greater potential for waste: say, we only reserved 800 bytes, 
but the next inverted list segment occupies 801 bytes. Sec- 
ond, since no pre-allocation policy can be prescient, there 
will inevitably be fragmentation in the segment pool unless 
we start moving data around to eliminate memory gaps — 
at which point, we're basically writing a garbage collector 
(another non-trivial challenge). In contrast, our buffering 
approach does not create any empty space in the segment 
pool, and the additional memory requirements of the buffer 
maps are transient (i.e., freed after the indexing process is 
complete). For these reasons, we feel that our approach is 
superior and rejected the alternative. 

7. CONCLUSION AND FUTURE WORK 

One area of future work is to explore the interleaving of in- 
dexing and retrieval operations, which requires care to man- 
age concurrent access to global data structures. Since this 
paper focuses on indexing and not on query evaluation per 
se, we have set aside this complexity for now. However, 
we see at least two methodological issues that need to be 
addressed in such a study: first, we do not have a realistic 
model of document and query arrival, and second, we need 
new metrics to quantify query evaluation speed to factor 
out differences due to queries issued at different times over 
indexes of different sizes. 

In this paper, we have taken an initial foray into studying 
in- memory indexing algorithms, an underexplored region in 
the design space. Our finding that postings list contiguity 
matters only to a certain extent contributes to our under- 
standing of information retrieval algorithms in the context 
of modern processor architectures. We believe that other 
aspects of information retrieval algorithms will also need to 
be reexamined, because the tradeoffs become very different 
once disk is removed from the picture. 
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